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We congratulate Samuel Kou, Qing Zhou and Wing Wong (re- 
ferred to subsequently as KZW) for this beautifully written paper, 
which opens a new direction in Monte Carlo computation. This dis- 
cussion has two parts. First, we describe a very closely related method, 
multicanonical sampling (MCS), and report a simulation example 
that compares the equi-energy (EE) sampler with MCS. Overall, we 
found the two algorithms to be of comparable efficiency for the sim- 
ulation problem considered. In the second part, we develop some 
additional convergence results for the EE sampler. 

1. A multicanonical sampling algorithm. Here, we take on KZW's dis- 
cussion about the comparison of the EE sampler and MCS. We compare 
the EE sampler with a general state-space extension of MCS proposed by 
Atchade and Liu [1]. We compare the two algorithms on the multimodal 
distribution discussed by KZW in Section 3.4. 

Let (X, B,X) be the state space equipped with its cr-algebra and ap- 
propriate measure, and let ir(x) oc e~ h ^ be the density of interest. Fol- 
lowing the notation of KZW, we let Hq < Hi < ■ ■ ■ < Hx e < Hk e +i = oo 
be a sequence of energy levels and let Dj = {x 6 X:h(x) G \Hj,Hj + \)}, 

<j < K e , be the energy rings. For x S X, define I(x) = j if x S Dj. Let 

1 = To < T\ < • ■ • < Tx t be a sequence of "temperatures." We use the no- 
tation fcW( x ) = e - h ^y T \ so that vrW(x) = k®( x )/fk®(x)\(dx). Clearly, 
7r(°) = tt. We find it more convenient to use the notation 7r^ instead of 7Tj 
as in KZW. Also note that we did not flatten tt^ as KZW did. 

The goal of our MCS method is to generate a Markov chain on the space 
X x {0, 1, . . . , Kt} with invariant distribution 

^ k^ix) 
vr(x, i) oc 22 1 D J (x)X(dx), 

j=0 Z *J 
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where Zij = / k^ l \x)li> j (x)\(dx). With a well-chosen temperature sequence 
(Tj) and energy levels (Hj), such a Markov chain would move very easily 
from any temperature level X x {i} to another. And inside each tempera- 
ture level X x {i}, the algorithm would move very easily from any energy 
ring Dj to another. Unfortunately, the normalizing constants Z%a are not 
known. They are estimated as part of the algorithm using the Wang-Landau 
recursion that we describe below. To give the details, we need a proposal ker- 
nel Qi(x,dy) = qi(x,dy)\(dy) on X, a proposal kernel A(i,j) on {0, ...,K t } 
and (7n), a sequence of positive numbers. We discuss the choice of these 
parameters later. 

Algorithm 1.1 (Multicanonical sampling). 

Initialization. Start the algorithm with some arbitrary (Xo,io) & X x 
{0, 1, . . . , K t }. For i = 0, ...,Kt, j = 0,...,K e we set all the weights to 

#(i) = i- 

Recursion. Given (X n ,t n ) = (x,i) and (4>n (j)), flip a #-coin. 

If Tail. Sample Y ~ Qi(x, •). Set X n+ i = Y with probability a^ l \x,Y); 
otherwise set X n+ \ = x, where 



(1.1) a^(x,y) = min 

Set t n +i = i. 



k^(y)^\l(x)) qi (y,x) 

G 1 



If Head. Sample j ~ A(i, •). Set t n+ \ = j with probability (3 x (i,j); oth- 
erwise set t n+ i = i, where 



k®{x) <Pn\l(x)) A(j,i) 



(1.2) (3 x (i,j)= mm 

Set X n+ i = x. 

Update the weights. Write (t n +i,I(X n+ i)) = (io?Jo)- Set 

(1-3) ^ ) 1 (j ) = 4 lo) (jo)(l+7n), 

leaving the other weights unchanged. 

If we choose Kt = in the algorithm above we obtain the MCS of [9] (the 
first MCS algorithm is due to [4]) and K e = gives the simulated tempering 
algorithm of [6]. The recursion (1.3) is where the weights Z^j are being 
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estimated. Note that the resulting algorithm is no longer Markovian. Under 
some general assumptions, it is shown in [1] that On\j) '■= 

J D . ^^(x) X(dx) as n — > oo. 

The MCS can be seen as a random-scan-Gibbs sampler on the two vari- 
ables (x,i) S X x {0, . . . ,Kt}, so the choice 9 = 1/2 for coin flipping works 
well. The proposal kernels Qi can be chosen as in a standard Metropolis- 
Hastings algorithm. But one should allow Qi to make larger proposal moves 
for larger i (i.e., hotter distributions). The proposal kernel A can be chosen 
as a random walk on {0, . . . ,Kt} (with reflection on and Kt). In our sim- 
ulations, we use A(0,1) = A(K t ,K t - 1) = 1, A(i,i- 1) = A(i,i + 1) = 1/2 
for i£{0,K t }. 

It can be shown that the sequence (6 n ) defined above follows a stochastic 
approximation with step size (pf n )- So choosing (j n ) is the same problem 
as choosing a step-size sequence in a stochastic approximation algorithm. 
We follow the new method proposed by Wang and Landau [9] where (7„) is 
selected adaptively. Wang and Landau's idea is to monitor the convergence of 
the algorithm and adapt the step size accordingly. We start with some initial 
value 70 and (j n ) is defined by j n = (1 + 7o) 1 ^ fc+1 ^ — 1 for 77 < n < Tk+i, 
where = To < T\ < • ■ ■ is a sequence of stopping times. Assuming Tj finite, 
Tj + i is the next time k > 7$ where the occupation measures (obtained from 
time Ti + 1 on) of all the energy rings in all the temperature levels are 
approximately equal. Various rules can be used to check that the occupation 
measures are approximately equal. Following [9], we check that the smallest 
occupation measure obtained is greater than c times the mean occupation, 
where c is some constant (e.g., c = 0.2) that depends on the complexity of 
the sampling problem. 

It is an interesting question to know whether this method of choosing the 
step-size sequence can be extended to more general stochastic approximation 
algorithms. A theoretical justification of the efficiency of the method is also 
an open question. 

2. Comparison of EE sampler and MCS. To use MCS to estimate inte- 
grals of interest such as M no (g(X)), one can proceed as KZW did by writing 
E W0 (g(X)) = J2j=oPj^Tr (giX)\X E Dj). Samples from the high-temperature 
chains can be used to estimate the integrals W, 7T0 (g(X)\X € Dj) by impor- 
tance reweighting in the same way as KZW did. In the case of MCS, the 

AO), ... 

probabilities pj = Pr no (X G Dj ) are estimated by pj = ^" ^ 



We compared the performances of the EE sampler and the MCS described 
above for the multimodal example in Section 3.4 of KZW. To make the two 
samplers comparable, each chain in the EE sampler was run for N iterations. 
We did the simulations for N = 10 4 , N = 5 x 10 4 and N = 10 x 10 4 . For the 
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Table 1 

Improvement of MCS over EE as given by 
(o'EE(ff) - 5-mc(s0)/o"mc(s) x 100 







E(Xi) 


E(X a ) 


E(X|) 


Pr»(l£B) 


N = 


10 4 


13.77 


12.53 


8.98 


-63.49 




5 x 10 4 


6.99 


-7.31 


-10.15 


-51.22 


N = 


10 5 


1.92 


5.79 


4.99 


-55.31 



*The comparisons are based on 100 replications of the samplers for 
each N. 



MC sampler, we used Kt = K e = K and the algorithm was run for (K + 
1) X N total iterations. We repeated each sampler for n = 100 iterations in 
order to estimate the finite sample standard deviations of the estimates they 
provided. Table 1 gives the improvements (in percentage) of MCS over EE 
sampling. Pi\(X 6 B) is the probability under tt of the union of all the discs 
with centers \i{ (the means of the mixture) and radius a/2. As we can see, 
when estimating global functions such as moments of the distribution, the 
two samplers have about the same accuracy with a slight advantage for MCS. 
But the EE sampler outperformed MCS when estimating Pr„-(X € B). The 
MCS is an importance sampling algorithm with a stationary distribution 
that is more widespread than ir. This may account for the better performance 
obtained by the EE sampler on ~Pr n (X £ B). More thorough empirical and 
theoretical analyses are apparently required to reach any firmer conclusions. 

3. Ergodicity of the equi-energy sampler. In this section we take a more 
technical look at the EE algorithm and derive some ergodicity results. First, 
we would like to mention that in the proof of Theorem 2, it is not clear to us 
how KZW derive the convergence in (5). Equation (5) implicitly uses some 
form of convergence of the distribution of Xn +1 ' to 7r^ +1 ^ as n — > oo and 
it is not clear to us how that follows from the assumption that Pr(X^i € 

A\X^ +1) = x) -> S( i+1 \x,A) asn^oo for all x, all A. 

In the analysis below we fix that problem, but under a more stringent 
assumption. To state our result, let {X,B) be the state space of each of the 
equi-energy chains. If Pi and P2 are two transition kernels on X, the product 
P1P2 is also a transition kernel defined as ^1^2(^5^) = / Pi( x ,dy)P2(y,A). 
Recursively, we define Pf as P} = P x and Pf = If / is a measurable 

real- valued function on X and fi is a measure on X, we denote Pf(x) := 
j P(x,dy)f(y) and fi(f) := / fi(dx)f(x). Also, for c€ (0, 00) we write |/| < c 
to mean < c for all x £ X. We define the following distance between 
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Pi and P 2 : 

(3.1) |||Pi - P 2 ||| := sup sup - P 2 f(x)\, 

xex |/|<i 

where the supremum is taken over all x 6 X and over all measurable func- 
tions / : X — > M with |/| < 1. We say that the transition kernel P is uniformly 
geometrically ergodic if there exists p € (0,1) such that 

(3.2) |||P n - 7r||| = <9(p n ). 

It is well known that (3.2) holds if and only if there exist e > 0, a non- 
trivial probability measure v and an integer m > 1 such that the so-called 
M(m, £, v) minorization condition holds, that is, P m (x,A) > eu(A) for all 

x G X and A^B (see, e.g., [8], Proposition 2). We recall that T^ H denotes 
the Metropolis-Hastings kernel in the EE sampler. The following result is 
true for the EE sampler. 

Theorem 3.1. Assume thatVi € {0, . . -,K}, T$ H satisfies a M(l,£i,7rW) 
minorization condition and that condition (hi) of Theorem 2 of the paper 
holds. Then for any bounded measurable function f , as re — > oo ; 

1 n 

(3.3) E[/(*W)]— ttW(/) and ±£> *«(/). 

For example, if A? is a compact space and e~ h ^ x ' remains bounded away 
from and oo, then (3.3) holds. Note that the ith chain in the EE sam- 
pler is actually a nonhomogeneous Markov chain with transition kernels 
K$ , where k£> (x, A) := Pt[X® +1 £ A\X^ ] =x].ks pointed out by 

KZW, for any x € X and AeB, K$(x,A) S®(x,A) as re — > oo, where 
is the limit transition kernel in the EE sampler. This setup brings to 
mind the following convergence result for nonhomogeneous Markov chains 
(see [5], Theorem V.4.5): 

Theorem 3.2. Let P, Pq, P±, . . . be a sequence of transition kernels on 
(X, B) such that \\\P n — P\\\ — » and P is uniformly geometrically ergodic with 
invariant distribution n. Then the Markov chain with transition kernels (Pi) 
is strongly ergodic; that is, for any initial distribution fi, 

(3.4) |||/xiyV--Pf»-7r|->0 asiwoo. 

The difficulty in applying this theorem to the EE sampler is that we do 
not have \IK$ — S^\\\ — ► but only a setwise convergence \K„ (x, A) — 
S®(x,A)\ -» for each x G X, A 6 B. The solution we propose is to extend 
Theorem 3.2 as follows. 
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Theorem 3.3. Let P, Pq, P%, . . . be a sequence of transition kernels on 
(X,B) such that: 

(i) For any x 6 X and A& B, P n (x, A) — ► P(x, A) as n — > oo. 

(ii) P has invariant distribution ir and P n has invariant distribution ix n . 
There exists p € (0, 1) such that \\P k - 7r||| = 0(p k ) and \\\P k - 7r„||| = 0(p k ). 

(iii) |||P n - P n _i HI < 0(n~ x ) for some A > 0. 

Then, if (X n ) is an X -valued Markov chain with initial distribution p and 
transition kernels (P n ), for any bounded measurable function f we have 

1 n 

(3.5) E[/(X n )]— »7r(/) and - £ f(X k ) ^ tt(/) as m oo. 

We believe that this result can be extended to the more general class of 
^-geometrically ergodic transition kernels and then one could weaken the 

(i) 

uniform minorization assumption on TjVjj in Theorem 3.1. But the proof will 
certainly be more technical. We now proceed to the proofs of the theorems. 
We first prove Theorem 3.3 and use it to prove Theorem 3.1. 

Proof of Theorem 3.3. It can be easily shown from (ii) that ||7T n — 
To-ill < jr^lll-fn — Pn-i|||- Therefore, Theorems 3.1 and 3.2 of [2] apply and 
assert that for any bounded measurable function /, E[f(X n ) — vr n (/)] — > 
and ^ Yjc=i\f(Xk) — 7r fc(/)] ^ as n — > oo. To finish, we need to prove that 
ffn(f) ~~ * 7r (/) as n ^ oo. To this end, we need the following technical lemma 
proved in [7], Chapter 11, Proposition 18. 

Lemma 3.1. Let (/„) be a sequence of measurable functions and let 
p,pi, ... be a sequence of probability measures such that \ f n \ < 1 and f n —*f 
pointwise [f n (%) —* f(x) for all x 6 X] and p n ^> p setwise [p n (A) — > p(A) 
for all A€B\. Then f f n {x)p n {dx) — > f f{x)p{dx). 

Here is how to prove that vr n (/) — ► n(f) as in oo. By (i), we have 
Pnf{x) — > Pf{x) for all x £ X. Then, by (i) and Lemma 3.1, P%f(x) = 
Pn{Pnf){x) — > P 2 f(x) as n — ► oo. By recursion, for any x € X and k > 1, 
P k f(x) — > P k f(x) as n — > oo. Now, write 

k„(/) - 7T(/)| < K(f) - P k f{ X )\ + \P k f( X ) - P k f{ X )\ 

(3.6) +|pV(x)-vt(/)| 

< 2p k sup |/(x)| + \P k f(x) - P k f(x)\ [by (ii)]. 
xex 

Since \P k f{x) - P k f{x)\ -» 0, we see that \it n {f) - vr(/)| -» 0. □ 
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Proof of Theorem 3.1. Let (fl,.F,P) be the probability triplet on 
which the equi-energy process is defined and let E be its expectation opera- 
tor. The result is clearly true by assumption for i = K. Assuming that it is 
true for the (i + l)st chain, we will prove it for the ith. chain. 

The random process (Xn ^ ) is a nonhomogeneous Markov chain with tran- 
sition kernel Kn\x,A) := PrpT^L € A\X$ = x\. For any bounded measur- 
able function /, operates on / as follows: 



K®f{x) = (1 -p ee )T^ i f(x)+p ee E[R^f(x)}, 
where R$ f(x) is a ratio of empirical sums of the (i + l)st chain of the form 



L,k=-N L D I(x) > 

(3.7) 

L,k=-N L D I(x) {* k ) 

[take R^f(x) = and p cc = when Y,k=-N 1 D I(x) (X { k i+1) ) = 0], where a®(x, 

is the acceptance probability min[l, ^iyM^Tiyfa) ] • ^ * s ^ ow ^ on S the + ^) s * 
chain has been running before the ith chain started. Because (3.3) is assumed 
true for the (i + l)st chain and condition (iii) of Theorem 2 of the paper 

holds, we can assume in the sequel that J2k=-N^-D I{x) {X k +1 ^) > 1 for all 
n > 1. We prove the theorem through a series of lemmas. 

Lemma 3.2. For the EE sampler, assumption (i) of Theorem 3.3 holds 
true. 

Proof. Because (3.3) is assumed true for the (i + l)st chain, the strong 
law of large numbers and Lebesgue's dominated convergence theorem apply 
to R { n ] f{x) and assert that for all x £ X and A 6 B, K$(x,A) -> S®(x,A) 
as n -> oo, where 5 (l) (x, A) = (l-poc)^^^, A)+p cc Ejio r EE ( x ) 

(i 7') 

where is the transition kernel of the Metropolis-Hastings with proposal 
distribution 

^ i+1 Hy)l Dj (y)/pf +1) 

and invariant distribution 

7T^(x)l Dj (x)/ P f\ □ 



Lemma 3.3. For the EE sampler, assumption (ii) of Theorem 3.3 holds. 
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(i) (i) 

Proof. Clearly, the minorization condition on 2V™ transfers to K n . 

(i) . (i) 

It then follows that each Kn admits an invariant distribution 7Tn and is 
uniformly geometrically ergodic toward 7Tn with a rate pi = 1 — (1 — p ce )£i- 
The limit transition kernel S^> in the EE sampler as detailed above has 
invariant distribution ir^ and also inherits the minorization condition on 

T-iO) r-i 
MH' LJ 



Lemma 3.4. For the EE sampler, assumption (hi) of Theorem 3.3 holds 
true with A = 1. 

Proof. Any sequence (x n ) of the form x n = ^Atl 1 — ^— ^ can always be 
written recursively as x n = x n _i + ^ °" — (u n — Using this, we easily 

have the bound 

\K^f(x)-K { : ) _ 1 f(x)\<2E 



for all x € X and |/| < 1. Therefore, the lemma will be proved if we can 
show that 



(3.8) 



sup E 

0<j<K 



n 



O(l). 



To do so, we fix j € {0, ... , K} and take e £ (0, 5), where 5 = (1 — p cc )ei + \ x 
7r( i+1 )(Dj) >0. We have 



E 



n 



u=- N id, (4 m) : 



(3.9) 



E 



(i+l) \ (EL-iV ^ iX^ +1) )>n(d-e)} 



+ E 



?1 



The first term on the right-hand side of (3.9) is bounded by 1/(6 — e). 
The second term is bounded by 



n Pr 



(3.10) 



E iD J (4 m) ) + E(^(4 m) )-^)<-^ 

=-N fc=l 
<nPr[Af£ +1) >ne], 
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where mH +1) = Y%=1 K k-\ ] ( X k~i ] > D j) ~ 1 D,{xj! +1) ). For the inequality 
in (3.10), we use the minorization condition K^^\x, Dj) > 5. Now, the se- 
quence is a martingale with increments bounded by 1. By Azuma's 
inequality ([3], Lemma 1), we have nPr[M^ +1) > ne] < nexp(— ne 2 /2) — ► 
as n — > co. □ 



Theorem 3.1 now follows from Theorem 3.3. □ 
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